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Abstract 

Previous  biological  models  of  object  recognition  in  cortex  have  been  evaluated  using 
idealized  scenes  and  have  hard-coded  features,  such  as  the  HMAX  model  by  Riesen- 
huber  and  Poggio  [10].  Because  HMAX  uses  the  same  set  of  features  for  all  object 
classes,  it  does  not  perform  well  in  the  task  of  detecting  a  target  object  in  clutter. 
This  thesis  presents  a  new  model  that  integrates  learning  of  object-specihc  features 
with  the  HMAX.  The  new  model  performs  better  than  the  standard  HMAX  and  com¬ 
parably  to  a  computer  vision  system  on  face  detection.  Results  from  experimenting 
with  unsupervised  learning  of  features  and  the  use  of  a  biologically-plausible  classiher 
are  presented. 
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1-1  The  HMAX  model.  The  first  layer,  SI,  consists  of  hlters  tuned  to 
different  areas  of  the  visual  held,  orientations  (oriented  bars  at  0,  45, 

90,  and  135  degrees)  and  scales.  These  hlters  are  analogous  to  the 
simple  cell  receptive  helds  found  in  the  VI  area  of  the  brain.  The  Cl 
layer  responses  are  obtained  by  performing  a  max  pooling  operations 
over  SI  hlters  that  are  tuned  to  the  same  orientation,  but  diherent 
scales  and  positions  over  some  neighborhood.  In  the  S2  layer,  the 
simple  features  from  the  Cl  layer  (the  4  bar  orientations)  are  combined 
into  2  by  2  arrangements  to  form  256  intermediate  feature  detectors. 

Each  C2  layer  unit  takes  the  max  over  all  S2  units  dihering  in  position 
and  scale  for  a  specihc  feature  and  feeds  its  output  into  the  view-tuned 
units.  In  our  new  model,  we  replace  the  hard-coded  256  intermediate 
features  at  the  S2  level  with  features  the  system  learns .  19 


2-1  Typical  stimuli  used  in  our  experiments.  From  left  to  right:  Training 
faces  and  non-faces,  “cluttered  (test)  faces” ,  “difficult  (test)  faces”  and 
test  non-faces .  26 
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2-2  Typical  stimuli  and  associated  responses  of  the  Cl  complex  cells  (4 
orientations).  Top;  Sample  synthetic  face  ,  cluttered  face,  real  face, 
non-faces.  Bottom:  The  corresponding  Cl  activations  to  those  images. 

Each  of  the  fonr  snbhgures  in  the  Cl  activation  hgures  maps  to  the 
fonr  bar  orientations  (clockwise  from  top  left:  0,  45,  135,  90  degrees). 

For  simplicity,  only  the  response  at  one  scale  is  displayed.  Note  that 
an  individual  Cl  cell  is  not  particularly  selective  either  to  face  or  to 
non-face  stimuli .  27 


2-3  Sketch  of  the  hmax  model  with  feature  learning:  Patterns  on  the  model 
“retina”  are  hrst  hltered  throngh  a  continuous  layer  SI  (simplihed  on 
the  sketch)  of  overlapping  simple  cell-like  receptive  helds  (hrst  deriva¬ 
tive  of  gaussians)  at  different  scales  and  orientations.  Neighboring  SI 
cells  in  tnrn  are  pooled  by  Cl  cells  through  a  max  operation.  The  next 
S2  layer  contains  the  RBF-like  units  that  are  tnned  to  object-parts  and 
compnte  a  fnnction  of  the  distance  between  the  input  units  and  the 
stored  prototypes  (p  =  4  in  the  example).  On  top  of  the  system,  C2 
cells  perform  a  max  operation  over  the  whole  visnal  held  and  provide 
the  hnal  encoding  of  the  stimnlus,  constitnting  the  inpnt  to  the  clas- 
siher.  The  diherence  to  standard  hmax  lies  in  the  connectivity  from 
Cl— s>S2  layer:  While  in  standard  hmax,  these  connections  are  hard¬ 
wired  to  prodnce  256  2  x  2  combinations  of  Cl  inputs,  they  are  now 
learned  from  the  data.  (Figure  adapted  from  [12]) .  28 
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2-4  Comparison  between  the  new  extended  model  using  object-specific 
learned  features  {p  =  5,  n  =  480,  m  =  120,  corresponding  to  the 
best  set  of  features),  the  machine  vision  face  detection  system  and  the 
standard  HMAX.  Top:  Detailed  performances  (roc  area)  on  (i)  “all 
faces”,  (ii)  “cluttered  faces”  only  and  (hi)  “real  faces”  only  (non-face 
images  remain  unchanged  in  the  ROC  calculation).  For  information, 
the  false  positive  rate  at  90%  true  positive  is  given  in  parenthesis.  The 
new  model  generalizes  well  on  all  sets  and  overall  outperforms  the  “AI” 
system  (especially  on  the  “cluttered”  set)  as  well  as  standard  HMAX. 
Bottom:  ROC  curves  for  each  system  on  the  test  set  including  “all  faces” .  30 

2-5  Average  C2  activation  of  synthetic  test  face  and  test  non-face  set.  Left: 
using  standard  HMAX  features.  Right:  using  features  learning  from 
synthetic  faces .  31 

2-6  Performance  (ROC  area)  of  features  learned  from  synthetic  faces  with 
respect  to  number  of  learned  features  n  and  p  (fixed  m  =  100).  Perfor¬ 
mance  increases  with  the  number  of  learned  features  to  a  certain  level 
and  levels  off.  Top  left:  system  performance  on  synthetic  test  set.  Top 
right:  system  performance  on  cluttered  test  set.  Bottom:  performance 
on  real  test  set .  32 

2-7  Performance  (ROC  area)  with  respect  to  %  face  area  covered  and  p. 
Intermediate  size  features  performed  best  on  synthetic  and  cluttered 
sets,  small  features  performed  best  on  real  faces.  Top  left:  system 
performance  on  synthetic  test  set.  Top  right:  system  performance  on 
cluttered  test  set.  Bottom  :  performance  on  real  test  set .  33 
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left  to  right):  Sample  synthetic  face,  Cl  activation  of  face  at  band  1, 
band  2,  band  3,  and  band  4.  Bottom:  Sample  non-faces,  Cl  activation 
of  non-face  at  band  1,  band  2,  band  3,  and  band  4.  Each  of  the  four 
subhgures  in  the  Cl  activation  hgures  maps  to  the  four  bar  orientations 

(clockwise  from  top  left:  0,  45,  135,  90  degrees) .  36 

3-2  Example  images  of  rescaled  faces.  From  left  to  right:  training  scale, 
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3-3  ROC  area  vs.  log  of  rescale  factor.  Trained  on  synthetic  faces,  tested 
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Chapter  1 


Introduction 


Detecting  a  pedestrian  in  your  view  while  driving.  Classifying  an  animal  as  a  cat 
or  a  dog.  Recognizing  a  familiar  face  in  a  crowd.  These  are  all  examples  of  object 
recognition  at  work.  A  system  that  performs  object  recognition  is  solving  a  diffi¬ 
cult  computational  problem.  There  is  high  variability  in  appearance  between  objects 
within  the  same  class  and  variability  in  viewing  conditions  for  a  specific  object.  The 
system  must  be  able  to  detect  the  presence  of  an  object-for  example,  a  face-under  dif¬ 
ferent  illuminations,  scale,  and  views,  while  distinguishing  it  from  background  clutter 
and  other  classes. 

The  primate  visual  system  seems  to  perform  object  recognition  effortlessly  while 
computer  vision  systems  still  lag  behind  in  performance.  How  does  the  primate  visual 
system  manage  to  work  both  quickly  and  with  high  accuracy?  Evidence  from  exper¬ 
iments  with  primates  indicates  that  the  ventral  visual  pathway,  the  neural  pathway 
for  initial  object  recognition  processing,  has  a  hierarchical,  feed-forward  architecture 
[11].  Several  biological  models  have  been  proposed  to  interpret  the  findings  from 
these  experiments.  One  such  computational  model  of  object  recognition  in  cortex  is 
HMAX.  HMAX  models  the  ventral  visual  pathway,  from  the  primary  visual  cortex 
(VI),  the  first  visual  area  in  the  cortex,  to  the  inferotemporal  cortex,  an  area  of  the 
brain  shown  to  be  critical  to  object  recognition  [5].  The  HMAX  model  architecture 
is  based  on  experimental  results  on  the  primate  visual  cortex,  and  therefore  can  be 
used  to  make  testable  predictions  about  the  visual  system. 
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While  HMAX  performs  well  for  paperclip-like  objects  [10],  the  hard-coded  features 
do  not  generalize  well  to  natural  images  and  clutter  (see  Chapter  2).  In  this  thesis 
we  build  upon  HMAX  by  adding  object-specihc  features  and  apply  the  new  model  to 
the  task  of  face  detection.  We  evaluate  the  properties  of  the  new  model  and  compare 
its  performance  to  the  original  HMAX  model  and  machine  vision  systems.  Further 
extensions  were  made  to  the  architecture  to  explore  unsupervised  learning  of  features 
and  the  use  of  a  biologically  plausible  classiher. 

1.1  Related  Work 

Object  recognition  can  be  viewed  as  a  learning  problem.  The  system  is  hrst  trained 
on  example  images  of  the  target  object  class  and  other  objects,  learning  to  distinguish 
between  them.  Then,  given  new  images,  the  system  can  detect  the  presence  of  the 
target  object  class. 

In  object  recognition  systems,  there  are  two  main  variables  in  an  approach  that 
distinguish  one  system  from  another.  The  hrst  variable  is  what  featnres  the  system 
uses  to  represent  object  classes.  These  features  can  be  generic,  which  can  be  used 
for  any  class,  or  class-specihc.  The  second  variable  is  the  classiher,  the  module  that 
determines  whether  an  object  is  from  the  target  class  or  not,  after  being  trained 
on  labeled  examples.  In  this  section,  I  will  review  previous  computer  vision  and 
biologically  motivated  object  recognition  systems  with  diherent  approaches  to  featnre 
representation  and  classihcation. 

1.1.1  Computer  Vision 

An  example  of  a  system  that  uses  generic  featnres  is  described  in  [8].  The  system 
represents  object  classes  in  terms  of  local  oriented  mnlti-scale  intensity  diherences 
between  adjacent  regions  in  the  images  and  is  trained  using  a  support  vector  machine 
(SVM)  classiher.  A  SVM  is  an  algorithm  that  hnds  the  optimal  separating  hyperplane 
between  two  classes  [17].  SVM  can  be  used  for  separable  and  non-separable  data  sets. 
For  separable  data,  a  linear  SVM  is  used,  and  the  best  separating  hyperplane  is  fonnd 
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in  the  feature  space.  For  non-separable  cases,  a  non-linear  SVM  is  used.  The  feature 
space  is  hrst  transformed  by  a  kernel  function  into  a  high-dimensional  space,  where 
the  optimal  hyperplane  is  found. 

In  contrast,  [2]  describes  a  component-based  face  detection  system  that  uses  class- 
specihc  features.  The  system  automatically  learns  components  by  growing  image 
parts  from  initial  seed  regions  until  error  in  detection  is  minimized.  From  these  im¬ 
age  parts,  components  are  chosen  to  represent  faces.  In  this  system,  the  image  parts 
and  their  geometric  arrangement  are  used  to  train  a  two-level  SVM.  The  first  level 
of  classihcation  consists  of  component  experts  that  detect  the  presence  of  the  com¬ 
ponents.  The  second  level  classihes  the  image  based  on  the  components  categorized 
in  the  hrst  level  and  their  positions  in  the  image. 

Another  object  recognition  system  that  uses  fragments  from  images  as  features  is 
[15].  This  system  uses  feature  selection  on  the  feature  set,  a  technique  we  will  explore 
in  a  later  chapter.  Ullman  and  Sail  choose  fragments  from  training  images  that 
maximize  the  mutual  information  between  the  fragment  and  the  class  it  represents. 
During  classihcation,  hrst  the  system  searches  the  test  image  at  each  location  for  the 
presence  of  the  stored  fragments.  In  the  second  stage,  each  location  is  associated  with 
a  magnitude  M,  a  weighted  sum  of  the  fragments  found  at  that  location.  For  each 
candidate  location,  the  system  verihes  that  (1)  the  fragments  are  from  a  sufficient 
subset  of  the  stored  fragments  and  (2)  positions  of  the  fragments  are  consistent  with 
each  other  (e.g.  for  detecting  an  upright  face,  the  mouth  fragment  should  be  located 
below  the  nose).  Based  on  the  magnitude  and  the  verihcation,  the  system  decides 
whether  or  not  the  presence  of  the  target  class  is  in  a  candidate  location. 

1.1.2  Biological  Vision 

The  primate  visual  system  has  a  hierarchical  structure,  building  up  from  simple  to 
more  complex  units.  Processing  in  the  visual  system  starts  in  the  primary  visual 
cortex  (VI),  where  simple  cells  respond  optimally  to  an  edge  at  a  particular  location 
and  orientation.  As  one  travels  further  along  the  visual  pathway  to  higher  order  visual 
areas  of  the  cortex,  cells  have  increasing  receptive  held  size  as  well  as  increasing 
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complexity.  The  last  purely  visual  area  in  the  cortex  is  the  inferotemporal  cortex 
(IT).  In  results  presented  in  [4],  neurons  were  found  in  monkey  IT  that  were  tnned  to 
specihc  views  of  training  objects  for  an  object  recognition  task.  In  addition,  neurons 
were  fonnd  that  were  scale,  translation,  and  rotation  invariant  to  some  degree.  These 
results  motivated  the  following  view-based  object  recognition  systems. 


SEEMORE 

SEEMORE  is  a  biologically  inspired  visnal  object  recognition  system  [6].  SEEMORE 
uses  a  set  of  receptive-held  like  feature  channels  to  encode  objects.  Each  featnre 
channel  Tj  is  sensitive  to  color,  angles,  blobs,  contours  or  texture.  The  activity  of  Fi 
can  be  estimated  as  the  number  of  occurrences  of  that  featnre  in  the  image.  The  snm 
of  occurrences  is  taken  over  various  parameters  such  as  position  and  scale  depending 
on  the  feature  type. 

The  training  and  test  sets  for  SEEMORE  are  color  video  images  of  3D  rigid  and 
non-rigid  objects.  The  training  set  consists  of  several  views  of  each  object  alone,  vary¬ 
ing  in  view  angle  and  scale.  Eor  testing,  the  system  has  to  recognize  novel  views  of 
the  objects  presented  alone  on  a  blank  backgronnd  or  degraded.  Five  possible  degra¬ 
dations  are  applied  to  the  test  views:  scrambling  the  image,  adding  occlusion,  adding 
another  object,  changing  the  color,  or  adding  noise.  The  system  uses  nearest-neighbor 
for  classihcation.  The  distance  between  two  views  is  calculated  as  the  weighted  city- 
block  distance  between  their  featnre  vectors.  The  training  view  that  has  the  least 
distance  from  a  test  view  is  considered  the  best  match. 

Although  SEEMORE  has  some  qualities  similar  to  biological  visual  systems,  such 
as  the  use  of  receptive-field  like  features  and  its  view-based  approach,  the  goal  of 
the  system  was  not  to  be  a  descriptive  model  of  an  actual  animal  visual  system  [6] 
and  therefore  can  not  be  used  to  make  testable  predictions  abont  biological  visual 
systems. 
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Figure  1-1:  The  HMAX  model.  The  hrst  layer,  SI,  consists  of  hlters  tuned  to  different 
areas  of  the  visual  held,  orientations  (oriented  bars  at  0,  45,  90,  and  135  degrees)  and 
scales.  These  hlters  are  analogous  to  the  simple  cell  receptive  helds  found  in  the  VI 
area  of  the  brain.  The  Cl  layer  responses  are  obtained  by  performing  a  max  pooling 
operations  over  SI  hlters  that  are  tuned  to  the  same  orientation,  but  diherent  scales 
and  positions  over  some  neighborhood.  In  the  S2  layer,  the  simple  features  from  the 
Cl  layer  (the  4  bar  orientations)  are  combined  into  2  by  2  arrangements  to  form  256 
intermediate  feature  detectors.  Each  C2  layer  unit  takes  the  max  over  all  S2  units 
dihering  in  position  and  scale  for  a  specihc  feature  and  feeds  its  output  into  the  view- 
tuned  units.  In  our  new  model,  we  replace  the  hard-coded  256  intermediate  features 
at  the  S2  level  with  features  the  system  learns. 
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HMAX 


HMAX  models  the  ventral  visual  pathway,  from  the  primary  visual  cortex  (VI),  the 
hrst  visual  area  in  the  cortex,  to  the  inferotemporal  cortex,  an  area  critical  to  object 
recognition  [5].  HMAX’s  structure  is  made  up  of  alternating  levels  of  S  units,  which 
perform  pattern  matching,  and  C  units,  which  take  the  max  of  the  S  level  responses. 

An  overview  of  the  model  can  be  seen  in  Figure  1-1.  The  hrst  layer,  SI,  consists 
of  hlters  (hrst  derivative  of  gaussians)  tuned  to  diherent  areas  of  the  visual  held, 
orientations  (oriented  bars  at  0,  45,  90,  and  135  degrees)  and  scales.  These  hlters 
are  analogous  to  the  simple  cell  receptive  helds  found  in  the  VI  area  of  the  brain. 
The  Cl  layer  responses  are  obtained  by  performing  a  max  pooling  operations  over  SI 
hlters  that  are  tuned  to  the  same  orientation,  but  diherent  scales  and  positions  over 
some  neighborhood.  In  the  S2  layer,  the  simple  features  from  the  Cl  layer  (the  4  bar 
orientations)  are  combined  into  2  by  2  arrangements  to  form  256  intermediate  feature 
detectors.  Each  C2  layer  unit  takes  the  max  over  all  S2  units  dihering  in  position 
and  scale  for  a  specihc  feature  and  feeds  its  output  into  the  view-tuned  units. 

By  having  this  alternating  S  and  C  level  architecture,  HMAX  can  increase  speci- 
hcity  in  feature  detectors  and  increase  invariance.  The  S  levels  increase  specihcity 
and  maintain  invariance.  The  increase  in  specihcity  stems  from  the  combination  of 
simpler  features  from  lower  levels  into  more  complex  features. 

HMAX  manages  to  increase  invariance  due  to  the  max  pooling  operation  at  the 
C  levels.  For  example,  suppose  a  horizontal  bar  at  a  certain  position  is  presented  to 
the  system.  Since  each  SI  hlter  template  matches  with  one  of  four  orientations  at 
dihering  positions  and  scales,  one  SI  cell  will  respond  most  strongly  to  this  bar.  If 
the  bar  is  translated,  the  SI  hlter  that  responded  most  strongly  to  the  horizontal  bar 
at  that  position  has  a  weaker  response.  The  hlter  whose  response  is  greatest  to  the 
horizontal  bar  at  the  new  position  will  have  a  stronger  response.  When  max  is  taken 
over  the  SI  cells  in  the  two  cases,  the  Cl  cell  that  receives  input  from  all  SI  hlters 
that  prefer  horizontal  bars  will  receive  the  same  level  of  input  on  both  cases. 

An  alternative  to  taking  the  max  is  taking  the  sum  of  the  responses.  When  taking 
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the  sum  of  the  SI  outputs,  the  Cl  cell  would  also  receive  the  same  input  from  the 
bar  in  the  original  position  and  the  moved  position.  Since  one  input  to  Cl  would 
have  decreased,  but  the  other  would  have  increased,  the  total  response  remains  the 
same.  However,  taking  the  sum  does  not  maintain  feature  specihcity  when  there  are 
multiple  bars  in  the  visual  held.  If  a  Cl  cell  is  presented  with  an  image  containing 
a  horizontal  and  vertical  bar,  when  summing  the  inputs,  the  response  level  does  not 
indicate  whether  or  not  there  is  a  horizontal  bar  in  the  held.  Responses  to  the  vertical 
and  the  horizontal  bar  are  both  included  in  the  summation.  On  the  other  hand,  if 
the  max  is  taken,  the  response  would  be  of  the  most  strongly  activated  input  cell. 
This  response  indicates  what  bar  orientation  is  present  in  the  image.  Because  max 
pooling  preserves  bar  orientation  information,  it  is  robust  to  clutter  [10]. 

The  HMAX  architecture  is  based  on  experimental  hndings  on  the  ventral  visual 
pathway  and  is  consistent  with  results  from  physiological  experiments  on  the  pri¬ 
mate  visual  system.  As  a  result,  it  is  a  good  biological  model  for  making  testable 
predictions. 


1.2  Motivation 

The  motivation  for  my  research  is  two-fold.  On  the  computational  neuroscience  side, 
previous  experiments  with  biological  models  have  mostly  been  with  single  objects  on 
a  blank  background,  which  do  not  simulate  realistic  viewing  conditions.  By  using 
HMAX  on  face  detection,  we  are  testing  out  a  biologically  plausible  model  of  object 
recognition  to  see  how  well  it  performs  on  a  real  world  task. 

In  addition,  in  HMAX,  the  intermediate  features  are  hard-coded  into  the  model 
and  learning  only  occurs  from  the  C2  level  to  the  view-tuned  units.  The  original 
HMAX  model  uses  the  same  features  for  all  object  classes.  Because  these  features  are 
2  by  2  combination  of  bar  orientations,  they  may  work  well  for  paperclip  like  objects 
[10],  but  not  for  natural  images  like  faces.  When  detecting  faces  in  an  image  with 
background  clutter,  these  generic  features  do  not  differentiate  between  the  face  and 
the  background  clutter.  For  a  face  on  clutter,  some  features  might  respond  strongly 
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to  the  face  while  others  respond  strongly  to  the  clutter,  since  the  features  are  specihc 
to  neither.  If  the  responses  to  clutter  are  stronger  than  the  ones  to  faces,  when  taking 
the  maximum  activation  over  all  these  features,  the  resulting  activation  pattern  will 
signal  the  presence  of  clutter,  instead  of  a  face.  Therefore  these  features  perform 
badly  in  face  detection.  The  extension  to  HMAX  would  permit  learning  of  features 
specihc  to  the  object  class  and  explores  learning  at  lower  stages  in  the  visual  system. 
Since  these  features  are  specihc  to  faces,  even  in  the  presence  of  clutter,  these  features 
will  have  a  greater  activation  to  faces  than  clutter  parts  of  the  images.  When  taking 
the  maximum  activation  over  these  features,  the  activation  pattern  will  be  robust 
to  clutter  and  still  signal  the  presence  of  a  face.  Using  class-specihc  features  should 
improve  performance  in  cluttered  images. 

For  computer  vision,  this  system  can  give  some  insight  how  to  improve  current 
object  recognition  algorithms  .  In  general,  computer  vision  algorithms  use  a  central¬ 
ized  approach  to  account  for  translation  and  scale  variation  in  images.  To  achieve 
translation  invariance,  a  global  window  is  scanned  over  the  image  to  search  for  the 
target  object.  To  normalize  for  scale,  the  image  is  replicated  at  diherent  scales,  and 
each  of  them  are  searched  in  turn.  In  contrast,  the  biological  model  uses  distributed 
processing  through  local  receptive  fields,  whose  outputs  are  pooled  together.  The 
pooling  builds  up  translation  and  scale  invariance  in  the  features  themselves,  allow¬ 
ing  the  system  to  detect  objects  in  images  of  different  scales  and  positions  without 
having  to  preprocess  the  image. 


1.3  Roadmap 

Chapter  2  explains  the  basic  face  detection  task,  HMAX  with  feature  learning  ar¬ 
chitecture,  and  analyzes  results  from  simulations  varying  system  parameters.  Per¬ 
formance  from  these  experiment  are  then  compared  to  the  original  HMAX.  Chapter 
3  presents  results  from  testing  the  scale  and  translation  invariance  of  HMAX  with 
feature  learning.  Next,  in  Chapter  4,  I  investigate  unsupervised  learning  of  features. 
Chapter  5  presents  results  from  using  a  biologically-plausible  classiher  with  the  sys- 
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tern.  Chapter  6  contains  conclusions  and  discussion  of  future  work. 
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Chapter  2 


Basic  Face  Detection 


In  this  chapter,  we  discuss  the  basic  HMAX  with  feature  learning  architecture,  com¬ 
pare  its  performance  to  standard  (original)  HMAX,  and  present  results  on  parameter 
dependence  experiments. 


2.1  Face  Detection  Task 

Each  system  (i.e.  standard  HMAX  and  HMAX  with  feature  learning)  is  trained  on  a 
reduced  data  set  similar  to  [2]  consisting  of  200  synthetic  frontal  face  images  generated 
from  3D  head  models  [18]  and  500  non-face  images  that  are  scenery  pictures.  The 
test  sets  consist  of  900  “synthetic  faces”,  900  “cluttered  faces”,  and  179  “real  faces”. 
The  “synthetic  faces”  are  generated  from  taking  face  images  from  3D  head  models 
[18]  that  are  different  from  training  but  are  synthesized  under  similar  illumination 
conditions.  The  “cluttered  faces”  are  the  “synthetic  faces”  set,  but  with  the  non-face 
image  as  background.  The  “real  faces”  are  real  frontal  faces  from  the  CMU  PIE  face 
database  [13]  presenting  untrained  extreme  illumination  conditions.  The  negative  test 
set  consists  of  4,377  background  images  consider  in  [1]  to  be  difficult  non-face  set.  We 
decided  to  use  a  non-face  set  for  testing  different  type  from  the  training  non-face  set 
because  we  wanted  to  test  using  non-faces  that  could  possibly  be  mistaken  for  faces. 
Examples  for  each  set  are  given  in  Figure  2-1. 
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Figure  2-1:  Typical  stimuli  used  in  our  experiments.  From  left  to  right:  Training  faces 
and  non-faces,  “cluttered  (test)  faces” ,  “difficult  (test)  faces”  and  test  non-faces. 

2.2  Methods 

2.2.1  Feature  Learning 

To  obtain  class-specific  features,  the  following  steps  are  performed  (the  steps  are 
shown  in  Figure  2-3):  (1)  Obtain  Cl  activations  of  training  images  using  HMAX.  Fig¬ 
ure  2-2  shows  example  Cl  activations  from  faces  and  non- faces.  (2)  Extract  patches 
from  training  faces  at  the  Cl  layer  level.  The  locations  of  the  patches  are  randomized 
with  each  run.  There  are  two  parameters  that  can  vary  at  this  step:  the  patch  size 
p  and  the  number  of  patches  m  extracted  from  each  face.  Each  patch  is  a  p  x  p  x  4 
pattern  of  Cl  activation  w,  where  the  last  4  comes  from  the  four  different  preferred 
orientations  of  Cl  units.  (3)  Obtain  the  set  of  features  u  by  performing  k-means, 
a  clustering  method  [3],  on  the  patches.  K-means  groups  the  patches  by  similarity. 
The  representative  patches  from  each  group  are  chosen  as  features,  the  number  of 
which  is  determined  by  another  parameter  n.  These  features  replace  the  intermediate 
S2  features  in  the  original  HMAX.  The  level  in  the  HMAX  hierarchy  where  feature 
learning  takes  place  is  indicated  by  the  arrow  in  Figure  1-1.  In  all  simulations,  p  var¬ 
ied  between  2  and  20,  n  varied  between  4  and  3,000,  and  m  varied  between  1  and  750. 
These  S2  units  behave  like  gaussian  RBF-units  and  compute  a  function  of  the  squared 
distance  between  an  input  pattern  and  the  stored  prototype:  f{x)  =  exp  — 
with  a  chosen  proportional  to  patch  size. 
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Figure  2-2:  Typical  stimuli  and  associated  responses  of  the  Cl  complex  cells  (4  ori¬ 
entations).  Top:  Sample  synthetic  face  ,  cluttered  face,  real  face,  non-faces.  Bottom: 
The  corresponding  Cl  activations  to  those  images.  Each  of  the  four  subhgures  in  the 
Cl  activation  hgures  maps  to  the  four  bar  orientations  (clockwise  from  top  left:  0, 
45,  135,  90  degrees).  For  simplicity,  only  the  response  at  one  scale  is  displayed.  Note 
that  an  individual  Cl  cell  is  not  particularly  selective  either  to  face  or  to  non-face 
stimuli. 
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RBF-like  Neurons  tuned  to  face  parts 


Figure  2-3:  Sketch  of  the  hmax  model  with  feature  learning:  Patterns  on  the  model 
“retina”  are  first  hltered  through  a  continuous  layer  SI  (simplihed  on  the  sketch)  of 
overlapping  simple  cell-like  receptive  fields  (first  derivative  of  gaussians)  at  different 
scales  and  orientations.  Neighboring  SI  cells  in  turn  are  pooled  by  Cl  cells  through 
a  MAX  operation.  The  next  S2  layer  contains  the  RBF-like  units  that  are  tuned  to 
object-parts  and  compute  a  function  of  the  distance  between  the  input  units  and  the 
stored  prototypes  (p  =  4  in  the  example).  On  top  of  the  system,  C2  cells  perform 
a  MAX  operation  over  the  whole  visual  held  and  provide  the  hnal  encoding  of  the 
stimulus,  constituting  the  input  to  the  classiher.  The  difference  to  standard  hmax  lies 
in  the  connectivity  from  Cl— s>S2  layer:  While  in  standard  hmax,  these  connections 
are  hardwired  to  produce  256  2  x  2  combinations  of  Cl  inputs,  they  are  now  learned 
from  the  data.  (Figure  adapted  from  [12]) 


2.2.2  Classification 


After  HMAX  encodes  the  images  by  a  vector  of  C2  activations,  this  representation  is 
used  as  input  to  the  classiher.  The  system  uses  a  Support  Vector  Machine  [17]  (svm) 
classiher,  a  learning  technique  that  has  been  used  successfully  in  recent  machine 
vision  systems  [2],  It  is  important  to  note  that  this  classiher  was  not  chosen  for 
its  biological  plausibility,  but  rather  as  an  established  classihcation  back-end  that 
allows  us  to  compare  the  quality  of  the  different  feature  sets  for  the  detection  task 
independent  of  the  classihcation  technique. 

2.3  Results 

2.3.1  Comparison  to  Standard  HMAX  and  Machine  Vision 
System 

As  we  can  see  from  Fig.  2-4,  the  performance  of  standard  HMAX  system  on  the  face 
detection  task  is  pretty  much  at  chance:  The  system  does  not  generalize  well  to  faces 
with  similar  illumination  conditions  but  include  background  (“cluttered  faces”)  or 
to  faces  in  untrained  illumination  conditions  (“real  faces”).  This  indicates  that  the 
generic  features  in  standard  HMAX  are  insufficient  to  perform  robust  face  detection. 
The  256  features  cannot  be  expected  to  show  any  specihcity  for  faces  vs.  background 
patterns.  In  particular,  for  an  image  containing  a  face  on  a  background  pattern,  some 
S2  features  will  be  most  activated  by  image  patches  belonging  to  the  face.  But,  for 
other  S2  features,  a  part  of  the  background  might  cause  a  stronger  activation  than 
any  part  of  the  face,  thus  interfering  with  the  response  that  would  have  been  caused 
by  the  face  alone.  This  interference  leads  to  poor  generalization  performances,  as 
shown  in  Fig.  2-4. 

As  an  illustration  of  the  feature  quality  of  the  new  model  vs.  standard  HMAX,  we 
compared  the  average  C2  activations  on  test  images  (synthetic  faces  and  non-faces) 
using  standard  HMAX’s  hard-coded  256  features  and  200  face-specihc  features.  As 
shown  in  Fig.  2-5,  using  the  learned  features,  the  average  activations  are  linearly 
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separable,  with  the  faces  having  higher  activations  than  non-faces.  In  contrast,  with 
the  hard-coded  features,  the  activation  for  faces  fall  in  the  same  range  as  non-faces, 
making  it  difficult  to  separate  the  classes  by  activation. 
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Figure  2-4:  Comparison  between  the  new  model  using  object-specihc  learned  features 
and  the  standard  HMAX  by  test  set.  For  synthetic  and  cluttered  face  test  sets,  the 
best  set  of  features  had  parameters:p  =  5,  n  =  480,  m  =  120.  For  real  face  test  set, 
the  best  set  of  features  were  p  =  2,  n  =  500,  m  =  125.  The  new  model  generalizes 
well  on  all  sets  and  outperforms  standard  HMAX. 


2.3.2  Parameter  Dependence 

Fig.  2-7  shows  the  dependence  of  the  model’s  performance  on  patch  size  p  and  the 
percentage  of  face  area  covered  by  the  features  (the  area  taken  up  by  one  feature  (p^) 
times  the  number  of  patches  extracted  per  faces  (m)  divided  by  the  area  covered  by 
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Image  number  Image  number 

Figure  2-5:  Average  C2  activation  of  synthetic  test  face  and  test  non-face  set.  Left: 
using  standard  HMAX  features.  Right:  using  features  learning  from  synthetic  faces. 

one  face).  As  the  percentage  of  the  face  area  covered  by  the  features  increases,  the 
overlap  between  features  should  in  principle  increase.  Features  of  intermediate  sizes 
work  best  for  “synthetic”  and  “cluttered”  faces  while  smaller  features  are  better 
for  “real”  faces.  Intermediate  features  work  best  for  detecting  faces  that  are  similar 
to  the  training  faces  because  hrst,  compared  with  larger  features,  they  probably  have 
more  flexibility  in  matching  a  greater  number  of  faces.  Secondly,  compared  to  smaller 
features  they  are  probably  more  selective  to  faces.  Those  results  are  in  good  agreement 
with  [16]  where  gray-value  features  of  intermediate  sizes  where  shown  to  have  higher 
mutual  information.  When  the  training  and  test  sets  contain  different  types  of  faces, 
such  as  synthetic  faces  vs.  real  faces,  the  larger  the  features,  the  less  capable  they  are 
to  generalize  to  real  faces.  Smaller  feature  work  the  best  for  real  faces  because  they 
capture  the  least  amount  of  detail  specihc  to  face  type. 

Performance  as  a  function  of  the  number  of  features  n  show  hrst  a  rise  with 
increasing  numbers  of  features  due  to  the  increased  discriminatory  power  of  the  feature 
dictionary.  However,  at  some  point  performance  levels  off.  With  smaller  features  (p  = 
2,  5),  the  leveling  off  point  occurs  at  a  larger  n  than  for  larger  features.  Because  small 
features  are  less  specihc  to  faces,  when  there  is  a  low  number  of  them,  the  activation 
pattern  of  face  and  non-faces  are  similar.  With  a  more  populated  feature  space  for 
faces,  the  activation  pattern  will  become  more  specihc  to  faces.  For  large  features, 
such  as  20x20  features  which  almost  cover  an  entire  face,  a  feature  set  of  one  will 

^5x5  and  7x7  features  for  which  performances  are  best  correspond  to  cells’  receptive  field  of 
about  a  third  of  a  face. 
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already  have  a  strong  preferences  to  similar  faces.  Therefore,  increasing  the  number 
of  features  has  little  effect.  Fig.  2-6  shows  performances  for  p  =  2,5,7,10,15,20, 
m  =  100,  and  n  =  25,  50, 100,  200,  300. 


Figure  2-6:  Performance  (ROC  area)  of  features  learned  from  synthetic  faces  with 
respect  to  number  of  learned  features  n  and  p  (fixed  m  =  100).  Performance  increases 
with  the  number  of  learned  features  to  a  certain  level  and  levels  off.  Top  left:  system 
performance  on  synthetic  test  set.  Top  right:  system  performance  on  cluttered  test 
set.  Bottom:  performance  on  real  test  set. 
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Figure  2-7:  Performance  (ROC  area)  with  respect  to  %  face  area  covered  and  p.  In¬ 
termediate  size  features  performed  best  on  synthetic  and  cluttered  sets,  small  features 
performed  best  on  real  faces.  Top  left:  system  performance  on  synthetic  test  set.  Top 
right:  system  performance  on  cluttered  test  set.  Bottom  :  performance  on  real  test 
set. 
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Chapter  3 


Invariance  in  HMAX  with  Feature 
Learning 


In  physiological  experiments  on  monkeys,  cells  in  the  inferotemporal  cortex  demon¬ 
strated  some  degree  of  translation  and  scale  invariance  [4],  Simulation  results  have 
shown  that  the  standard  HMAX  model  exhibits  scale  and  translation  invariance  [9], 
consistent  with  the  physiological  results.  This  chapter  examines  invariance  in  the 
performance  of  the  new  model,  HMAX  with  feature  learning. 

3.1  Scale  Invariance 

Scale  invariance  is  a  result  of  the  pooling  at  the  Cl  and  C2  levels  of  HMAX.  Pooling 
at  the  Cl  level  is  performed  in  four  scale  bands.  Band  1,  2,  3,  4  have  hlter  standard 
deviation  ranges  of  1.75-2.25,  2.75-3.75,  4.25-5.25,  and  5.75-7.25  pixels  and  spatial 
pooling  ranges  over  neighborhoods  of  4x4,  6x6,  9x9,  12x12  cells  respectively.  At 
the  C2  level,  the  system  pools  over  S2  activations  of  all  bands  to  get  the  maximum 
response. 

In  the  simulations  discussed  in  the  previous  chapter,  the  features  were  extracted 
at  band  2,  and  the  C2  activations  were  a  result  of  pooling  over  all  bands.  In  this 
section,  we  wish  to  explore  how  each  band  contributes  to  the  pooling  at  the  C2  level. 
As  band  size  increases,  the  area  of  the  image  which  a  receptive  held  covers  increases. 
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Figure  3-1:  Cl  activations  of  face  and  non-face  at  different  scale  bands.  Top  (from 
left  to  right);  Sample  synthetic  face,  Cl  activation  of  face  at  band  1,  band  2,  band 
3,  and  band  4.  Bottom:  Sample  non-faces,  Cl  activation  of  non-face  at  band  1,  band 
2,  band  3,  and  band  4.  Each  of  the  four  subfigures  in  the  Cl  activation  hgures  maps 
to  the  four  bar  orientations  (clockwise  from  top  left:  0,  45,  135,  90  degrees). 


Example  Cl  activations  at  each  band  are  shown  in  Fig.  3-1.  Our  hypothesis  is  that 
as  face  size  changes,  the  band  most  tuned  to  that  scale  will  “take  over”  and  become 
the  maximum  responding  band. 


Figure  3-2;  Example  images  of  rescaled  faces.  From  left  to  right:  training  scale,  test 
face  rescaled  -0.4  octave,  test  face  rescaled  -1-0.4  octave 


In  the  experiment,  features  are  extracted  from  synthetic  faces  at  band  2,  then  the 
system  is  trained  using  all  bands.  The  system  is  then  tested  on  synthetic  faces  on 
a  uniform  background,  resized  from  0.5-1. 5  times  the  training  size  (Fig.  3-2)  using 
bands  1-4  individually  at  the  C2  level  and  also  pooling  over  all  bands.  The  test 
non-face  sets  are  kept  at  normal  size,  but  are  pooled  over  the  same  bands  as  their 
respective  face  test  sets.  The  rescale  range  of  0.5-1. 5  was  chosen  to  try  to  test  bands 
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a  half-octave  above  and  an  octave  below  the  training  band. 


Figure  3-3:  ROC  area  vs.  log  of  rescale  factor.  Trained  on  synthetic  faces,  tested  on 
900  rescaled  synthetic  test  faces.  Images  size  is  100x100  pixels 

As  shown  in  Fig.  3-3,  for  small  faces,  the  system  at  band  1  performs  the  best  out 
of  all  the  bands.  As  face  size  increases,  performance  at  band  1  drops  and  band  2  take 
over  to  become  the  dominate  band.  At  band  3,  system  performance  also  increase 
as  face  size  increases.  At  large  face  sizes  (1.5  times  training  size),  band  3  becomes 
the  dominate  band  while  band  2  starts  to  decrease  in  performance.  Band  4  has 
poor  performance  for  all  face  sizes.  Since  its  receptive  fields  are  an  octave  above  the 
training  band’s,  to  see  if  band  4  continues  its  upward  trend  in  performance  we  re-ran 
the  simulations  with  200x200  images  and  a  rescale  range  of  0.5-2  times  the  training 
size. 

The  average  C2  activation  to  synthetic  test  faces  vs.  rescale  amount  is  shown 
in  Fig.  3-4.  The  behavior  of  the  C2  activations  as  image  size  changes  is  consistent 
with  the  ROC  area  data  above.  At  small  sizes,  band  1  has  the  greatest  average  C2 
activations.  As  the  size  becomes  closer  to  the  training  size,  band  2  becomes  the  most 
activated  band.  At  large  face  sizes,  band  3  is  the  most  activated.  For  band  4,  as 
expected,  the  C2  activation  increases  as  face  size  increases,  however,  its  activation  is 
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consistently  lower  than  any  of  the  other  bands.  In  this  rescale  range,  band  4  is  bad 
for  detecting  faces.  Additional  experiments  to  try  is  to  increase  the  image  size  and 
rescale  range  furthers  to  see  if  band  4  follows  this  upward  trend,  or  train  with  band 
3  and  since  band  4  and  3  are  closer  in  scale  than  band  2  and  4,  performance  should 
improve. 


Figure  3-4:  Average  C2  activation  vs.  log  of  rescale  factor.  Trained  on  synthetic 
faces,  tested  on  900  rescaled  synthetic  test  faces.  Image  size  is  200x200  pixels 

These  results  (from  performance  measured  by  ROC  area  and  average  C2  activa¬ 
tions)  agree  with  the  “take  over”  effect  we  expected  to  see.  As  face  size  decreases  and 
band  scale  is  held  constant,  the  area  of  the  face  a  Cl  cell  covers  increases.  The  Cl 
activations  of  the  smaller  face  will  match  poorly  with  the  features  trained  at  band  2. 
However,  when  the  Cl  activations  are  taken  using  band  1,  each  Cl  cell  pools  over  a 
smaller  area,  thereby  compensating  for  rescaling.  Similarly  as  face  size  increases  from 
the  training  size,  the  Cl  cell  covers  less  area.  Going  from  band  2  to  band  3,  each  Cl 
cell  pools  over  a  larger  area. 

When  using  all  bands  (Fig.  3-3),  performance  stays  relatively  constant  for  sizes 
around  the  training  size,  then  starts  to  drop  off  slightly  at  the  ends.  The  system 
has  constant  performance  even  though  face  size  changes  because  the  C2  responses 
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are  pooled  from  all  bands.  As  the  face  size  varies,  we  see  from  the  performance  of 
the  system  on  individnal  bands  that  at  least  one  band  will  be  strongly  activated  and 
signal  the  presence  of  a  face.  Although  face  scale  may  change,  by  pooling  over  all 
bands,  the  system  can  still  detects  the  presence  of  the  resized  face. 

3.2  Translation  Invariance 

Like  scale  invariance,  translation  invariance  is  the  result  of  the  HMAX  pooling  mech¬ 
anism.  From  the  SI  to  the  Cl  level,  each  Cl  cell  pools  over  a  local  neighborhood  of 
SI  cells,  the  range  determined  by  the  scale  band.  At  the  C2  level,  after  pooling  over 
all  scales,  HMAX  pools  over  all  positions  to  get  the  maximum  response  to  a  feature. 


Figure  3-5:  Examples  of  translated  faces.  From  left  to  right:  training  position,  test 
face  shifted  20  pixels,  test  face  shifted  50  pixels 

To  test  translation  invariance,  we  trained  the  system  on  200x200  pixels  faces  and 
non-faces.  The  training  faces  are  centered  frontal  faces.  For  the  face  test  set,  we 
translated  the  images  0,  10,  20,  30,  40,  and  50  pixels  either  up,  down,  left,  or  right. 
Example  training  and  test  faces  can  be  seen  in  Fig.  3-5. 

From  the  results  of  this  experiments  (Fig.  3-6),  we  can  see  that  performance  stays 
relatively  constant  as  face  position  changes,  demonstrating  the  translation  invariance 
property  of  HMAX. 
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Figure  3-6:  ROC  area  vs.  translation  amount.  Trained  on  200  centered  synthetic 
faces,  tested  on  900  translated  synthetic  test  faces. 
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Chapter  4 


Exploring  Features 


In  the  previous  experiments,  the  system  has  been  trained  using  features  extracted  only 
from  faces.  However,  training  with  features  from  synthetic  faces  on  blank  background 
does  not  reflect  the  real  world  learning  situation  where  there  are  imperfect  training 
stimuli  consisting  of  both  the  target  class  and  distractor  objects.  In  this  chapter,  I 
explore  (1)  training  with  more  realistic  feature  sets,  and  (2)  selecting  “good”  features 
from  these  sets  to  improve  performance. 

4.1  Different  Feature  Sets 

The  various  feature  sets  used  for  training  are: 

1.  “face  only”  features  -  from  synthetic  faces  with  blank  background  (the  same  set 
used  in  previous  chapters,  mentioned  here  for  comparison) 

2.  “mixed”  features  -  from  synthetic  faces  with  blank  background  and  from  non¬ 
faces  (equal  amount  of  face  and  non-face  patches  fed  into  k-means  to  get  feature 
set) 

3.  “cluttered”  features”  -  from  cluttered  synthetic  faces  (training  set  size  of  900) 

4.  “mixed  cluttered”  features  -  from  both  cluttered  synthetic  faces  and  non-faces 
(equal  amount  of  cluttered  face  and  non-face  patches  fed  into  k-means  to  get 
feature  set) 
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5.  features  from  real  faces  (training  set  size  of  42) 


For  each  simulation,  the  training  faces  used  correspond  with  the  feature  set  used. 
For  example,  when  training  using  “mixed  cluttered”  features,  cluttered  faces  are  used 
as  the  training  face  set  for  the  classiher.  The  test  sets  used  are  the  same  as  the  system 
described  in  Chapter  2:  900  synthetic  faces,  900  cluttered  faces,  179  real  faces,  and 
4,377  non-faces. 

The  performance  of  the  feature  sets  are  shown  in  Fig.  4-1.  For  all  feature  sets,  the 
test  face  set  most  similar  to  the  training  set  performed  best.  This  result  makes  sense 
since  the  most  similar  test  set  would  have  the  same  distribution  of  C2  activations  as 
the  training  set. 
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Figure  4-1:  Performance  of  features  extracted  from  synthetic,  cluttered,  and  real 
training  sets,  tested  on  synthetic,  cluttered,  and  real  tests  sets  using  SVM  classiher. 
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“Mixed”  features  perform  worse  than  face  only  features.  Since  these  features 
consist  of  face  and  non-face  patches,  these  features  are  no  longer  as  discriminatory 
for  faces.  Faces  respond  poorly  to  the  non-face  tuned  features  while  non-faces  are  more 
activated.  Looking  at  the  training  sets’  C2  activations  using  “mixed”  features  (Fig.  4- 
2),  we  see  that  the  average  C2  activation  of  synthetic  faces  decreases  as  compared  to 
the  average  C2  activation  using  face  only  features,  while  the  average  C2  activation  of 
non-faces  increases.  As  a  result,  the  two  classes  are  not  as  easily  separable,  accounting 
for  the  poor  performance.  To  improve  performance,  feature  selection  is  explored  in 
the  next  section. 


Image  number  Image  number 


Figure  4-2:  Average  C2  activation  of  training  sets.  Left:  using  face  only  features 
Right:  using  mixed  features. 

“Mixed  clutter”  features  also  display  poor  performance  for  the  cluttered  face  test 
set,  although  performance  on  real  faces  is  better  than  when  trained  on  cluttered 
features.  To  explore  the  reason  behind  these  results,  we  have  to  examine  the  fea¬ 
tures  themselves,  what  is  the  distribution  of  “good”  features  (ones  that  are  better  at 
distinguishing  between  faces  and  non-faces)  and  “bad”  features.  One  technique  to 
measure  how  “good”  a  feature  is  by  calculating  its  ROC.  Figures  4-3  to  4-6  show  the 
distribution  of  features  by  ROC  for  feature  sets  1-4. 

Mixed  features  sets  (“mixed”,  “mixed  cluttered”)  have  more  features  with  low 
ROCs  than  pure  face  feature  sets  (“face  only”,  “cluttered”),  but  less  features  with 
high  ROCs.  If  we  take  low  ROC  to  mean  that  these  features  are  good  non-face 
detectors,  including  non-face  patches  produces  features  tuned  to  non-faces.  In  Fig.  4- 
6,  when  using  “cluttered”  features  vs.  “mixed  cluttered”  features  on  real  faces,  both 
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Figure  4-3:  ROC  distribution  of  feature  sets  when  calculated  over  their  respective 
training  sets 

have  very  few  good  face  detectors,  as  indicated  by  the  absences  of  high  ROC  features. 
However,  the  “mixed  cluttered”  set  has  more  features  tuned  to  non-faces.  Having 
more  non-face  features  may  be  a  reason  why  “mixed  cluttered”  performs  better  on 
real  faces:  these  features  can  better  distinguish  non-faces  from  real  faces. 

We  compare  our  system  trained  on  real  faces  with  other  face  detections  systems: 
the  component-based  system  described  in  [2],  and  a  whole  face  classiher  [7].  HMAX 
with  feature  learning  performs  better  than  machine  vision  systems  (Fig.  4-7).  Some 
possible  reasons  for  the  better  performance:  (1)  our  system  uses  real  faces  to  train, 
while  the  component-based  system  uses  synthetic  faces,  so  our  features  are  more  tuned 
to  real  faces  (2)  our  features  are  constructed  from  Cl  units,  while  the  component- 
based  system’s  features  are  pixel  values.  Our  features,  along  with  HMAX’s  hierar¬ 
chical  structure,  make  the  features  more  generalizable  to  images  in  different  viewing 
conditions.  (3)  the  component-based  system  uses  an  SVM  classiher  to  learn  features 
while  our  system  uses  k-means.  The  SVM  requires  a  large  number  of  training  examples 
in  order  to  hnd  the  best  separating  hyperplane.  Since  we  only  train  with  42  faces,  we 
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Figure  4-4:  ROC  distribution  of  feature  sets  when  calculated  over  synthetic  face  set 


should  expect  the  computer  vision  system  ’s  performance  to  improve  if  we  increase 
training  set  size.  The  whole  face  classiher  is  trained  on  real  faces  and  uses  a  whole 
face  template  to  detect  faces.  From  these  results,  it  seems  that  face  parts  are  more 
flexible  to  variations  in  faces  than  a  face  template. 
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Figure  4-5:  ROC  distribution  of  feature  sets  when  calculated  over  cluttered  face  set 
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Figure  4-6:  ROC  distribution  of  feature  sets  when  calculated  over  real  face  set 
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Figure  4-7:  Comparison  of  HMAX  with  feature  learning,  trained  on  real  faces  and 
tested  on  real  faces,  with  computer  vision  systems. 
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4.2  Feature  Selection 


In  training  using  the  “mixed”  and  “mixed  cluttered”  feature  sets,  we  are  taking  a  step 
toward  unsupervised  learning.  Instead  of  training  only  on  features  from  the  target 
class,  the  system  is  given  features  possibly  from  faces  and  non-faces. 

We  try  to  improve  performance  of  these  mixed  feature  sets  by  selecting  a  subset  of 
“good”  features.  We  apply  the  following  methods  to  select  features,  picking  feature 
by  : 

1.  highest  ROC  -  pick  features  that  have  high  hit  rates  for  faces  and  low  false 
alarm  rates  for  non-faces 

2.  highest  and  lowest  ROC  -  features  that  are  are  good  face  or  non-face  detectors. 
Chosen  by  taking  the  features  with  ROC  farthest  from  0.5 

3.  highest  average  C2  activation  -  high  C2  activations  on  training  faces  maybe 
equivalent  to  good  face  detecting  features  [10] 

4.  mutual  information  -  pick  out  features  that  contribute  the  most  amount  of  infor¬ 

mation  to  deciding  image  class.  Mutual  information  for  a  feature  is  calculated 
by:  MI{C,X)  =  p{c,  x)  \og{p{c,  x) / {p{x)p{c))  where  C  is  the  class  (face 

or  non- face)  and  X  is  the  feature  (value  ranges  from  0-1).  This  feature  selection 
method  has  been  used  in  computer  vision  systems  [16].  Note:  In  the  algorithm, 
X  takes  on  discrete  values.  Since  the  responses  to  a  feature  can  take  on  a  con¬ 
tinuous  value  between  0-1,  we  discretized  the  responses,  flooring  the  value  to 
the  nearest  tenth. 

5.  random  -  baseline  performance  to  compare  with  above  methods  (averaged  over 
hve  iterations) 

Results  of  applying  the  five  feature  selection  techniques  on  “mixed”  features  and 
“mixed  cluttered”  features  to  get  100  best  features  are  shown  in  Fig.  4-8  and  Fig.  4-9 
respectively. 
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Figure  4-8:  Performance  of  feature  selection  on  “mixed” features.  Left:  for  cluttered 
face  set.  Right:  for  real  face  set.  In  each  hgure,  ROC  area  of  performance  with  (from 
left  to  right):  face  only  features,  all  mixed  features,  highest  and  lowest  ROC,  only 
highest  ROC,  average  C2  activation,  mutual  information,  and  randomly.  ROC  areas 
are  given  at  the  top  of  each  bar. 

In  all  the  feature  selection  results,  picking  features  by  highest  ROC  alone  (method 
1)  performed  better  than  by  highest  and  lowest  ROC  (method  2).  From  the  better 
performance,  we  can  conclude  that  picking  by  highest  ROC,  even  though  it  may  in¬ 
clude  features  with  ROC’s  around  chance,  the  system  performs  better  than  including 
low  ROC  features  but  having  fewer  high  ROC  features.  Although  from  the  previous 
section,  we  saw  that  good  non-face  features  did  help  performance  for  the  real  face  set, 
in  that  case  there  were  very  few  face  features  so  good  non-faces  features  had  more 
impact.  In  comparison  to  having  more  face  features,  non-face  features  seem  not  to  be 
as  important.  There  seems  to  be  two  possible  reasons  for  this  result  that  come  from 
comparing  the  ROC  of  features  on  the  training  sets  versus  the  test  sets  (Fig.  4-10  and 
Fig.  4-11).  First,  when  picking  features  by  method  1,  we  get  some  features  that  have 
ROCs  around  chance  for  the  training  set,  but  they  have  high  ROCs  for  the  test  sets. 
If  we  use  method  2,  these  features  are  not  picked.  Secondly,  the  training  and  test 
non-face  sets  are  different  types  of  non-faces.  The  hrst  consists  of  scenery  pictures, 
while  the  latter  are  hard  non-faces  as  deemed  by  an  LDA  classiher  [1].  The  features 
tuned  to  the  training  non-face  set  may  perform  poorly  on  the  test  non-face  set.  In 
Fig.  4-11,  we  see  that  the  training  and  test  feature  ROCs  are  less  correlated  for  low 
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Figure  4-9:  Performance  of  feature  selection  on  “mixed  cluttered” features.  Top  left: 
for  synthetic  face  set.  Top  right:  for  cluttered  face  set.  Bottom:  for  real  face  set.  In 
each  hgure,  ROC  area  of  performance  with  (from  left  to  right):  face  only  features,  all 
mixed  features,  highest  and  lowest  ROC,  only  highest  ROC,  average  C2  activation, 
mutual  information,  and  randomly.  ROC  areas  are  given  at  the  top  of  each  bar. 

ROCs  than  for  high  ROC  ,  showing  that  non-face  detectors  do  not  generalize  well. 

In  the  “mixed”  features  set,  for  cluttered  faces,  selection  by  highest  ROC  value 
performed  the  best,  almost  as  well  as  faces  only.  For  real  faces,  feature  selection  by  C2 
activation  performed  the  best.  Also,  in  the  “mixed  cluttered”  feature  set,  C2  average 
selection  method  performed  the  best  out  of  all  the  methods  for  all  test  sets.  Since  the 
activations  were  averaged  over  only  face  responses,  picking  the  features  with  the  high 
response  to  faces  might  translate  into  good  face  detectors  that  are  robust  to  clutter 
|10]. 

Mutual  information  (MI)  of  “mixed”  features  calculated  using  the  training  set 
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Figure  4-10:  Feature  ROC  comparison  between  the  “mixed”  features  training  set  and 
test  sets.  Left:  Feature  ROC  taken  over  training  set  vs.  cluttered  faces  and  non-face 
test  sets.  Right:  Feature  ROC  taken  over  training  set  vs.  real  faces  and  non-face  test 
sets. 

have  low  correlation  (all  less  than  0.1)  with  ones  calculated  using  the  test  sets.  The 
features  that  have  high  MI  for  the  training  set  may  or  may  not  have  high  MI  for  the 
test  sets.  Therefore  we  do  not  expect  performance  of  feature  selection  by  MI  to  be 
any  better  than  random,  which  is  what  we  see  in  Fig.  4-8.  For  the  “mixed  cluttered” 
feature  set,  the  MI  correlation  between  the  training  set  and  synthetic,  cluttered,  and 
real  test  sets  are  0.15,  0.20,  and  0.0550  respectively.  The  increased  correlation  for 
synthetic  and  cluttered  sets  may  be  why  we  see  better  performances  for  this  set 
(Fig.  4-9),  than  for  “mixed”  features. 


4.3  Conclusions 

In  this  chapter,  we  explored  using  unsupervised  learning  to  obtain  features,  then 
selecting  “good”  features  to  improve  performance.  The  results  have  shown  that  only 
selecting  good  face  features  (from  using  methods  such  as  highest  ROC  and  average 
C2  activation)  are  more  effective  than  selecting  both  good  face  and  non-face  features. 
Because  faces  have  less  variability  in  shape  than  non-faces  (which  can  be  images  of 
buildings,  cars,  trees,  etc.),  a  good  non-face  feature  for  one  set  of  non-faces  may 
generalize  poorly  to  other  types  of  non-faces,  while  face  feature  responses  are  more 
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Figure  4-11:  Feature  ROC  comparison  between  the  “mixed  cluttered”  features  train¬ 
ing  set  and  test  sets.  Top  left:  Feature  ROC  taken  over  training  set  vs.  synthetic  face 
and  non-face  test  sets.  Top  right:  Feature  ROC  taken  over  training  set  vs.  cluttered 
face  and  non-face  test  sets.  Bottom:  Feature  ROC  taken  over  training  set  vs.  real 
face  and  non-face  test  sets. 

consistent  across  sets.  Selecting  features  by  average  C2  activation  gives  us  a  simple, 
biologically-plausible  method  to  hnd  good  features.  A  face  tuned  cell  can  be  created 
by  choosing  C2  units  as  afferents  that  response  highly  to  faces. 
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Chapter  5 


Biologically  Plausible  Classifier 


In  all  the  experiments  discussed  so  far,  the  classifier  used  is  the  SVM.  As  touched 
upon  in  Chapter  2,  we  decided  to  use  an  SVM  so  that  we  could  compare  the  results 
to  other  representations.  However,  SVM  is  not  a  biologically-plausible  classifier.  In 
the  training  stage  of  an  SVM,  in  order  to  find  the  optimal  separating  hyperplane,  the 
classifier  has  to  solve  a  quadratic  programming  problem.  This  problem  can  not  easily 
be  solved  by  neural  computation.  In  the  following  set  of  experiments,  we  replace  the 
SVM  with  a  simpler  classification  procedure. 

5.1  Methods 

After  obtaining  the  C2  activations  of  the  training  face  set,  we  use  k- means  (same 
algorithm  used  for  getting  features  from  patches)  on  the  C2  activation  to  get  “face 
prototypes”  (FP).  Instead  of  creating  representative  face  parts,  now  we  are  getting 
representative  whole  faces,  encoded  by  an  activation  pattern  over  C2  units.  The 
number  of  face  prototypes  is  a  parameter  /,  which  varied  from  1  to  30  in  these 
simulations.  To  classify  an  image,  the  system  takes  the  Euclidean  distance  between 
the  image’s  C2  activation  vector  and  the  C2  activation  of  each  face  prototype.  The 
minimum  distance  over  all  the  face  prototypes  is  recorded  as  that  image’s  likeness  to 
a  face.  The  face  prototypes  can  be  thought  of  as  RBF-like  face  units,  so  the  minimum 
distance  to  the  prototypes  is  equivalent  to  the  maximum  activation  over  all  the  face 
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units.  Our  hypothesis  is  that  face  images  will  have  similar  C2  activation  patterns  as 
the  face  prototypes,  so  their  maximum  activation  will  be  larger  than  non-face  images’. 
Then  to  distinguish  faces  from  non-faces,  the  system  can  set  a  maximum  activation 
threshold  value,  where  anything  above  the  threshold  is  a  face,  anything  below  it  is 
a  non-face,  creating  a  simple  classiher.  In  these  experiments,  the  classiher  did  not 
have  a  set  threshold.  To  measure  performance,  we  varied  the  threshold  to  produce 
an  ROC  curve. 

Since  k-means  initializes  its  centers  randomly,  for  all  experiments  in  this  chapter, 
we  average  the  results  over  hve  runs. 


5.2  Results 

5.2.1  Face  Prototype  Number  Dependence 

We  varied  the  number  of  face  prototypes  from  1  to  30  on  all  five  training  sets  to  see 
how  performance  changes  as  the  number  of  face  prototypes  increased.  The  results 
are  shown  in  Fig.  5-1. 

Performance  does  not  vary  greatly  when  training  on  cluttered  and  synthetic  faces. 
Yet  with  real  faces,  prototype  of  one  gives  the  best  performance,  then  for  increasing 
number  of  prototypes,  the  ROC  area  drops  sharply  then  levels  off. 

The  face  prototypes  cover  the  C2  unit  space  for  faces.  As  the  nnmber  of  prototypes 
increases,  the  better  the  space  that  is  covered.  What  the  C2  unit  space  for  face  and 
non-faces  looks  like  will  determine  what  effect  increasing  coverage  will  have  on  the 
performance  of  using  k-means  as  classiher.  Looking  at  the  distribution  of  the  average 
C2  activation  of  the  training  sets  might  give  us  a  clue  (Fig.  5-2).  For  feature  sets 
trained  on  synthetic  and  cluttered  faces,  most  of  the  average  face  C2  activations  for 
these  features  cluster  around  the  mean  activation,  with  distribution  falling  off  as  the 
activations  are  farther  from  the  mean.  Therefore,  the  hrst  face  prototypes  captnre 
the  majority  of  the  faces.  As  face  prototype  nnmber  increases,  additional  prototypes 
captnre  the  outliers.  These  prototypes  might  increase  performance  by  covering  more 
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of  the  C2  unit  space,  but  they  also  might  decrease  performance  if  they  also  are  closer 
to  non-faces.  However,  since  outliers  are  few,  performance  does  not  fluctuate  greatly 
as  a  result. 

The  distribution  of  the  average  C2  units  of  the  training  real  faces  is  similar  to  the 
other  training  sets.  Its  standard  deviation  is  higher  than  the  other  sets,  which  indi¬ 
cates  that  the  outliers  are  further  away.  Additional  face  prototypes  are  then  further 
away  from  the  mean  as  well,  potentially  capturing  more  non-faces  and  decreasing 
performance.  However,  taking  the  average  does  not  give  us  much  insight  into  what  is 
happening  in  the  feature  space  because  it  reduces  the  whole  space  into  one  number. 
One  possible  solution  is  to  reduce  the  high-dimensional  feature  space  into  a  small 
enough  space  so  that  the  data  can  be  visualized  yet  still  maintain  the  space’s  struc¬ 
ture.  We  can  only  speculate  that  the  feature  space  is  shaped  such  that  the  additional 
face  prototypes  capture  the  face  outliers  but  also  non-faces. 
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(a)  face  only  features  (b)  mbced  features 


(c)  cluttered  features  (d)  mixed  cluttered  features 


(e)  real  face  features 


Figure  5-1:  Varying  number  of  face  prototypes.  Trained  and  tested  on  synthetic, 
cluttered  sets  using  k-means  classifier. 


#  of  faces  #  of  faces 


(a)  face  only  features 


(b)  mixed  features 


(c)  cluttered  features 


(d)  mixed  cluttered  features 


(e)  real  face  features 

Figure  5-2:  Distribution  of  average  C2  activations  on  training  face  set  for  different 
features  types. 
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5.2.2  Using  Face  Prototypes  on  Previous  Experiments 
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Figure  5-3:  Comparing  performance  of  SVM  to  k-means  classifier  on  the  four  feature 
types.  Number  of  face  prototypes  =  10.  From  top  left  going  clockwise:  on  face  only 
features,  mixed  features,  mixed  cluttered  features,  and  cluttered  features 

We  re-ran  the  same  simulations  presented  in  the  last  chapter,  but  replaced  the 
SVM  with  the  k-means  classiher.  Figure  5-3  compares  the  performance  of  using  the 
SVM  versus  the  k-means  classiher.  For  face  only,  “mixed”,  and  real  faces  feature 
sets  (Fig.  5-4),  k-means  performance  is  comparable  to  the  svM.  For  cluttered  and 
“mixed  cluttered”  feature  sets,  k-means  performs  worse  then  the  svM.  A  possible 
reason  for  the  decreased  performance  of  cluttered  training  sets  is  that  k-means  only 
uses  the  training  face  set  to  classify.  Face  prototypes  of  cluttered  faces,  might  be 
similar  to  both  faces  and  non-faces,  making  the  two  sets  hard  to  separate  based 
solely  on  distance  to  the  nearest  face.  In  contrast,  the  SVM  uses  both  face  and  non¬ 
face  information  to  hnd  the  best  separating  plane  between  the  two. 

The  results  for  the  feature  selection  simulations  are  shown  in  Figures  5-5  and  5- 
6.  The  relative  performance  of  the  various  feature  selection  methods  changes  when 
we  use  k-means.  For  example,  mutual  information  replaces  C2  as  the  best  method 
for  “mixed  cluttered”  features.  Because  k-means  and  SVM  are  inherently  different 
(one  uses  a  separating  plane,  the  other  using  minimum  distance  to  a  center),  it  is 
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Figure  5-4:  Comparison  of  HMAX  with  feature  learning  (using  SVM  and  k-means  as 
classifier,  trained  on  real  faces  and  tested  on  real  faces,  with  computer  vision  systems. 
The  k-means  system  used  1  face  prototype. 

expected  that  their  outcomes  might  differ.  The  two  classihers  weigh  the  features 
differently.  The  SVM  sets  weights  to  the  features  that  maximizes  the  width  of  the 
separating  hyperplane.  In  the  k-means  classifier,  how  much  each  feature  contributes 
depends  on  where  in  the  feature  space  the  nearest  face  prototype  is.  Any  of  the 
feature  selection  methods  could  have  selected  both  good  and  bad  features  in  varying 
proportions.  Performance  then  depends  on  which  features  the  classiher  weights  more. 
Further  exploration  into  the  reasons  for  the  different  outputs  is  relegated  to  future 
work. 
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C2Kmeans:  mixed  features  on  cluttered  face  set 


C2Kmeans:  mixed  features  on  real  face  set 


Figure  5-5:  Performance  of  feature  selection  on  “mixed” features  using  the  k-means 
classifier.  Left:  for  cluttered  face  set.  Right:  for  real  face  set.  Feature  selection 
methods  listed  in  the  legend  in  the  same  notation  used  as  Chapter  4. 
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ROC  area 


C2Kmeans:  mixed  cluttered  features  on  synthetic  face  set  C2Kmeans:  mixed  cluttered  features  on  cluttered  face  set 


#  of  FPs 


Figure  5-6:  Performance  of  feature  selection  on  “mixed  cluttered” features  using  the 
k-means  classifier.  Top:  for  synthetic  face  set.  Bottom  left:  for  cluttered  face  set. 
Bottom  right:  for  real  face  set.  Feature  selection  methods  listed  in  the  legend  in  the 
same  notation  as  in  Chapter  4. 
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5.3  Conclusions 


The  SVM ’s  training  stage  is  complex  and  not  biologically  plausible,  but  once  the  sep¬ 
arating  hyperplane  is  found,  the  classihcation  stage  is  simple.  Given  the  hyperplane, 
a  data  point  is  classihed  based  on  which  side  of  the  plane  it  is  on.  The  k-means  train¬ 
ing  stage  is  biologically  feasible  to  implement  using  self-organizing  cortical  maps.  For 
classihcation,  face-tuned  cells  can  set  an  activation  threshold,  and  anything  above 
that  threshold  is  labeled  a  face.  Both  the  SVM  and  k-means  classiher  have  some 
usability  issues.  The  SVM  requires  a  substantial  number  of  training  points  in  order 
to  perform  well.  In  addition,  there  are  parameters  that  one  can  vary  such  as  kernel 
type  and  data  chunk  size  that  inhuence  performance.  For  the  k-means  classiher,  per¬ 
formance  is  poorer  if  non-face  information  is  needed  to  classify  images,  for  example 
images  that  contain  clutter.  Secondly,  the  issue  of  how  many  face  prototypes  to  use 
to  get  the  best  performance  is  dependent  on  the  shape  of  the  feature  space.  What 
the  optimal  number  for  one  training  set  may  not  apply  to  another  set. 
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Chapter  6 


Discussion 


Computer  vision  system  traditionally  have  simple  features,  such  as  wavelets  and  pixel 
value,  along  with  a  complex  classihcation  procedure  (preprocessing  to  normalize  for 
illumination,  searching  for  object  in  different  scaled  images  and  position)  [8,  2,  15]. 
In  contrast,  HMAX,  a  biological  computer  vision  system,  has  complex  features,  but 
a  simpler  classihcation  stage.  The  pooling  performed  in  HMAX  builds  scale  and 
translation  invariance  into  the  hnal  C2  encoding.  However,  HMAX’s  hard-coded 
features  do  not  perform  well  on  the  face  detection  task. 

The  goal  of  the  new  HMAX  model  was  to  replace  the  hard-coded  featnres  with 
object  specihc  features  -namely  face  parts,  and  see  if  performance  improved.  As 
expected,  HMAX  with  feature  learning  performed  better  than  the  standard  HMAX 
on  the  face  detection  task.  The  average  C2  activations  of  the  two  type  of  features 
show  that  the  object  specihc  features  are  more  tuned  toward  faces  than  non-faces, 
while  HMAX’s  features  have  no  such  preference.  By  integrating  object-specihcity 
into  the  HMAX  architectnre,  we  were  able  to  bnild  a  system  whose  performance  is 
competitive  with  cnrrent  computer  vision  systems. 

Additional  simulations  fonnd  that  the  new  model  also  exhibited  scale  and  trans¬ 
lation  invariance.  Explorations  into  unsupervised  feature  learning,  feature  selections, 
and  the  use  of  a  simple  classiher  gave  promising  results.  However,  more  investigation 
into  the  underlying  mechanisms  behind  the  results  needs  to  be  done  in  order  to  have 
a  fnll  nnderstanding. 
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For  future  work  on  the  HMAX  model,  further  exploration  on  feature  selection 
and  alternative  classihcation  methods  needs  to  be  done  to  turn  the  new  model  into 
a  fully  biologically-plausible  system.  In  our  experiments,  we  have  only  looked  at  a 
basic  face  detection  task.  Theoretically,  the  model  should  be  easily  extendable  to 
other  tasks  just  by  replacing  the  features.  It  will  be  interesting  to  apply  the  model 
for  car  detection,  and  for  even  more  specihc  recognition  tasks,  such  as  recognition 
between  faces  with  the  same  features  we  currently  use.  Results  would  determine  if 
the  system  works  well  regardless  of  the  specihc  recognition  task. 
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